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Abstract 


In this paper, we propose a model-based clustering method (TVClust) that robustly 
incorporates noisy side information as soft-constraints and aims to seek a consensus between 
side information and the observed data. Our method is based on a nonparametric Bayesian 
hierarchical model that combines a probabilistic model for the data instances with one 
for the side-information. An efficient Gibbs sampling algorithm is proposed for posterior 
inference. Using the small-variance asymptotics of our probabilistic model, we derive a 
new deterministic clustering algorithm (RDP-means). It can be viewed as an extension of 
K-means that allows for the inclusion of side information and has the additional property 
that the number of clusters does not need to be specified a priori. We compare our work 
with many constrained clustering algorithms from the literature on a variety of data sets 
and conditions such as using noisy side information and erroneous k values. The results 
of our experiments show strong results for our probabilistic and deterministic approaches 
under these conditions when compared to other algorithms in the literature. 

Keywords: Constrained Clustering, Model-based methods, Two-view clustering, Asymp¬ 
totics, Non-parametric models. 
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1. Introduction 


We consider the problem of clustering with side information, focusing on the type of side 
information represented as pairwise cluster constraints between any two data instances. For 
example, when clustering genomics data, we could have prior knowledge on whether two 
proteins should be grouped together or not; when clustering pixels in an image, we would 
naturally impose spatial smoothness in the sense that nearby pixels are more likely to be 
clustered together. 


Side information has been shown to provide substantial improvement on clustering. For 


example, Jin et al. (2013) showed that combining additional tags with image visual fea¬ 


tures offered substantial benefits to information retrieval and Khoreva et al. (2014) showed 
that learning and combining additional knowledge (must-link constraints) offers substantial 
benefits to image segmentation. 


Despite the advantages of including side information, how to best incorporate it remains 
unresolved. Often the side-information in real applications can be noisy, as it is usually 
based on heuristic and inexact domain knowledge, and should not be treated as the ground 
truth which further complicates the problem. 


In this paper, we approach incorporating side information from a new perspective. We 
model the observed data instances and the side information (or constraint) as two sources 
of data that are independently generated by a latent clustering structure - hence we call our 
probabilistic model TVClust (Two-View Clustering). Specihcally, TVClust combines the 
mixture of Dirichlet Processes of the data instances and the random graph of constraints. 


We derive a Gibbs sampler for TVClust (Section 3). Furthermore, inspired by Jiang et al. 


(2012), we scale the variance of the aforementioned probabilistic model to derive a deter¬ 
ministic model. This can be seen as a generalization of K-means to a nonparametric number 
of clusters that also uses side instance-level information (Section 4). Since it is based on 
the DP-means algorithm (Jiang et al. 2012), and it uses relational side information we call 
our final algorithm Relational DP-means (RDP-means). Lastly, experiments and results 
are presented (Section 5) in which we investigate the behavior of our algorithm in different 
settings and compare to existing work in the literature. 


2. Related Work 


There has been a plethora of work that aims to enhance the performance of clustering via 
side information, either in deterministic or probabilistic settings. We refer the interested 
reader to existing comprehensive literature reviews of this subarea such as 


Basu et ^ (2008 


K -means with side information: Some of the earliest efforts to incorporate instance- 


level constraints for clustering were proposed by Wagstaff & Cardie (2000) and Wagstaff 


et al. (2001). In these papers, both must-link and cannot-link constraints were considered in 


a modified K-means algorithm. A limitation of their work is that the side information must 
be treated as the ground-truth and is incorporated into the models as hard constraints. 


Other algorithms similar in nature to K-means have been proposed as well that incor- 


porate soft constraints. These include MPCK-means 

Bilenko et al. 

(2004 

), Constrained 

Vector Quantization Error (CVQE) Pelleg & Baras 

( 

2007 

) and its variant Linear Con- 
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strained Quantization Error (LCVQE) Pelleg &: Baras (2007) ^ Unlike these approaches, 
our algorithm is derived from using small variance asymptotics on our probabilistic model 
and therefore is derived in a more principled fashion. Moreover, our deterministic model 
doesn’t require as input the goal number of clusters, as it determines this from the data. 

Probablistic clustering with side information: Motivated by enforcing smooth¬ 
ness for image segmentation, Orbanz &: Buhmann (2008) proposed combining a Markov 
Random Field (MRF) prior with a nonparametric Bayesian clustering model. One issue 
with their approach is that by its nature, MRF can only handle must-links but not cannot 
links. In contrast, our model, which is also based on a nonparametric Bayesian clustering 
model, can handle both types of constraints. 

Spectral clustering with side information: Following the long tail of works on 


spectral clustering techniques (e.g. Ng et al. (2002); Shi & Malik (2000)), they’re some 


works using these techniques with side information, mostly differing by how the Laplacian 
matrix is constructed or by various relaxations of the objective functions. These works 
include Constrained Spectral Clustering (CSR) |Wang &: Davidsm (2010) and Constrained 
1-Spectral Clustering (Cl-SC) Rangapuram & Hein (2012). 

Supervised elustering: There has been considerable interest in supervised clustering, 
where there is a labeling for all instances Finley & Joachims (2005); Zhu et al. (2011) and 
the goal is to create uniform clusters with all instances of a particular class. In our work, we 
aim to use side cues to improve the quality of clustering, making full labeling unnecessary 
as we can also make use of partial and/or noisy labels. 

Non-parametric K-means: There has been recent work that bridges the gap between 
probabilistic clustering algorithms and deterministic algorithms. The work by Kulis &: 


Jordan (2011) and Jiang et al. (2012) show that by properly scaling the distributions of 


the components, one can derive an algorithm that is very similar to K-means but without 
requiring knowledge of the number of clusters, k. Instead, it requires another parameter 
A, but DP-means is much less sensitive to this parameter than K-means is to k. We use a 
similar technique to derive our proposed algorithm, RDP-means. 


3. A Nonparametric Bayesian Model 


In this section, we introduce our probabilistic model based on multi-view learning (Blum & 


Mitchell, 1998). In multi-view learning, the datas consists of multiple views (independent 


sources of information). In our approach we consider the following two views: 


1. A set of observations {xj G 


2. The side information, between pairs of points, indicating how likely or unlikely two 
points are to appear in the same cluster. The side information is represented by a sym¬ 
metric nxn matrix F: if a priori Xj and Xj are believed to belong to the same cluster, 
then Eij = 1. If they are believed to be in different clusters, Eij = 0. Otherwise, if 
there is no side information about the pair (i,j), we denote it with Eij = NULL. For 
future reference, denote the set of side information as C = {(i, j) : Eij NULL}. 


1. These models are further studied in 


Covoes et al. 


(20131. 


3 

































































Daniel Khashabi, John Wieting, Jeffrey Yufei Liu, Feng Liang 


We refer to our data, xi:„ and E, as two different views of the underlying clustering 
structure. It is worth noting that either view is sufficient for clustering with existing algo¬ 
rithms. Given only the data instances Xj’s, it is the familiar clustering task where many 


methods such as K-means, model-based clustering (Fraley & Raftery, 2002) and DPM can 


be applied. Given the side information E, many graph-based clustering algorithms, such as 
normalized graph-cut ( |Shi &: Malik 2000) and spectral clustering ( Ng et akf 2002) can be 
applied. 

Our approach tries to aggregate information from the two views through a Bayesian 
framework and reach a consensus about the cluster structure. Given the latent clustering 
structure, data from the two views is modeled independently by two generative models: 
xi:„ is modeled by a Dirichlet Process Mixture (DPM) model (Antoniak, 1974; Ferguson 


1973) and E is modeled by a random graph (Erdos &: Renyi, 1959). 


Aggregating the two views of xi:„ and E is particularly useful when neither view can be 
fully trusted. While previous work such as constrained K-means or constrained EM assume 
and rely on constraint exactness, TVClust uses E in a “soft” manner and is more robust 
to errors. We can call Eij = 1 a may link and Eij = 0 a may-not link, in contrast with the 
aforementioned must-link and cannot-link, to emphasize that our model tolerates noise in 
the side information. 


3.1 Model for Data Instances 

We use the Mixture of Dirichlet Processes as the underlying clustering model for the data 
instances {xj})E]^. Let 9i denote the model parameter associated with observation Xj, which 
is modeled as an iid sample from a random distribution G. A Dirichlet Process DP(q:, Go) 
is used as the prior for G: 


0i,...,0„|G~ G, G~DP(a,Go). 


( 1 ) 


Denote the collection (0i,..., 0i_i, ..., 0n) by 9\i. With prior specification (1), the 

distribution of 9i given 9\i (after integrating out G) can be found following the Baldcwell- 
MacQueen urn scheme (jBlackwell Sz MacQueen 1973|): 


K 


p{9i\9\i) oc y^^n-i^kS0i{9i) +aGo{9i), 


( 2 ) 


k=l 


where we assume there are K unique values among 9\i, denoted by 91,...,9*k, 6eii-) is the 
Kronecker delta function, and ri-i^k is the number of instances accumulated in cluster k 
excluding instance i. From ([^ we can see a natural clustering effect in the sense that with 
a positive probability, 9i will take an existing value from 9^,... ,9"^, i.e. it will join one of 
the K clusters. This effect can be interpreted using the Ghinese Restaurant Process (GRP) 


metaphor Aldous (1983), where assigning 9i to a cluster is analogous to a new customer 


choosing a table in a Chinese restaurant. The customer can join an already occupied table 
or start a new one. 

Given 9i, we use a parametric family p(xj|0j) to model the instance Xj. In this paper, 
we focus on exponential families: 


p{x\9) = exp ((r(x), 9) - ip{9) - h{x)) 


( 3 ) 
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where 'tp{0) = log/exp((T(x),0) — /i(x))dx is the log-partition function (cumulant gen¬ 
erating function) and T(x) is the vector of sufficient statistics, given input point x. To 
simplify the exposition, we assume that x is the augmented vector of sufficient statistics 
given an input point, and simplify Q by removing T(-): 

p(x|(9) = exp((x,6»)-/i(x)). (4) 

It is easy to show that for this formulation, 

Ep [x] = Vem, (5) 

Covp[x] =Vl'4>{9). (6) 

For convenience, we choose the base measure Go, in DP(q;,Go) from the conjugate family, 

which takes the following form: 

dGoi9\T, rj) = exp {{9, t) - r]'il^{9) - m{T, r])) , (7) 

where r and r] are parameters of the prior distribution. Given these definitions of the 
likelihood and conjugate prior, the posterior distribution over 9 is an exponential family 
distribution of the same form as the prior distribution, but with scaled parameters r -|- x 
and 1. 

Exponential families contain many popular distributions used in practice. For example, 
Gaussian families are often used to model real valued points in MP, which correspond to 
T(x) = [x, x^x]^, and 9 = (;U,S), where /x is the mean vector, and S is the covariance 
matrix. The base measure often chosen for Gaussian families is its conjugate prior, the 
Normal-Inverse-Wishart distribution. Another popular parametric family, the multinomial 
distribution, is often used to model word counts in text mining or histograms in image 
segmentation. This distribution corresponds to T(x) = x and the base measure is often 
chosen to be a Dirichlet distribution, its conjugate prior. 

3.2 Model for Side Information 

Given 9i:n = {9i,... ,9n), we can summarize the clustering structure by a matrix H^xn 
where Hij = Jg.(0j). Note that H should not be confused with E. E represents the side 
information and can be viewed as a random realization based on the true clustering structure 
H. We want to infer H based on E and xi:„. 

We model E using the following generative process: with probability p an existing edge 
of H is preserved in E, and with probability q a false edge (of H) is added to E, i.e. for 
any (x,j) € C: 

p{Eij = 0\Hij = 1) = p, 

^ p{Eij = l\Hij = 0) = 1 - p, 

p{Eij = 0\Hij = 0) = q, 

p{Eij = l\Hij = 0) = 1- q. 

or more concisely, 

p{Eij\Hij,p,q) (g) 
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Figure 1: Graphical representation of TVClust. The data generating process for the data 
instances is on the left and the process for the side information is on the right. 


The values p and q represent the credibility of the values in the matrix E, while the valnes 
1 — p and 1 — q are error probabilities. One may be able to set the value for {p, q) based 
on expert knowledge or learn them from the data in a fully Bayesian approach by adding 
another layer of priors over p and q, 

p ~ Beta(ap, /3p), q ~ Beta(Q;g, /3g). 


3.3 Posterior Inference via Gibbs Sampling 

A graphical representation of our model TVClust is shown in Figure Based on the 
parameters 9i,... ,9^ the full data likelihood is 

n n 

p{xi.n,E\9i:n) = Y[p{^i\Gi) P, Q) , 

2=1 


where p{Eij = NULL|0i:„,p, g) = 1, i.e. no side information is provided for pair 

A Gibbs sampling scheme can be derived for our TVClust model, which is in spirit 
similar to other Gibbs samplers for DPM (see the comprehensive review at Neal (2000)). 
The Gibbs sampler involves iteratively sampling from the full conditional distribution of 
each unknown parameter given other parameters and the data. The key step is the sampling 
of p{9i\9\i,Xi, E,p, q). Using the independence between variables (see the graphical model 
in Figure [^, we have 


p{0i\0\i,^i:n, E,p, q) (X p{xi\9i)p{Ei\9i, 9\i,p, q)p{9i\9\i), 


(9) 


where we use Ei = {Eij : i ^ j} to denote the set of side information related to data instance 
i. Following the Blackwell-MacQueeen urn presentation of the prior ([^, we have 

K 

p{0i\0\i,^v.n,E,p,q) oc '^n_i^kP{y-i\0l)pi.Ei\9l,9\i,p,q)5ei{0i) 

k=l 

+ ap{-x.i\9i)p{Ei\9i,9\i,p, q)Go{9i) . (10) 
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The full conditional of 9i given others is a mixture of a discrete distribution with point 
masses located at and a continuous component. Sampling from the discrete 

component only involves evaluation of the likelihood function of Xj and Ei, which can 
be easily computed. Now we focus on sampling 9i from the continuous component. First 
observe that when 9i is sampled from this continuous component, we have Hij = Sg.{9j) = 0 
for all j ^ i, therefore: 

which does not depend on the actual value of 9i. Also note that 


p{Xi\9i)Go{9i) = PGo(^*|xi)pGo(Xj): 


where the subscript Go is used to emphasize that the posterior and the marginal distribu¬ 
tions, pGo(^il^i) PGo(^*)) calculated with respect to prior Gq. Then we can rewrite 
the sampling distribution for the continuous component as 

ap{yLi\9i)p{Ei\9i,9\i,p,q)Go{9i) oc apGo(xi) [ (1 - pGo(^i|xi)- 


Finally we can simplify the sampling distribution (10) as 


p{9i\9\i,:xLi,E,p,q) 


K 


oc J2n-i,kP{^i\9l)Sg*i9i) (- 


k=l 


p 


q 


Jk 


( 1 — P 


where 




= #{j ■0j = 9lE,j = l}, 
= #{j ■0j = 9l,E,j = 0}. 


+ aPGo(xj)pGo(^i|xi), (11) 


( 12 ) 


Using the analogy of the Chinese Restaurant interpretation of DPM (Aldous, 1983), we 


can interpret the sampling distribution of 9i in the following way. Let instance i be a friend 
of instance j, if Eij = 1. Similarly two instances are strangers, if Eij = 0. So is the 
number of friends of instance i at table k, and sf is the number of strangers for i at table 

k. 

As mentioned before, the values p and q represent the credibility of side information. 
For a reasonable confidence over constraints usually p > 1 — q. Then by ( |11| ), the chance 
of a person assigned to a table not only increases with the popularity of the table (i.e. the 
table size n-i^k) like in the original DPM, but also increases with their friend count fl and 
decreases with their stranger count s\.. 

Instead of sequentially updating the point-specific parameters {9i,... ,9^), one can 
sequentially update an equivalent parameter set: the set of cluster-specific parameters 
and the cluster assignment indicators ( 2 : 1 ,..., Zn), where Zi G {1,..., K} indi¬ 


cates the cluster assignment for instance i, i.e., 9i = 0*,. By our derivation at (11), we can 
update zfs sequentially as 


p{zi = k) 
p{zi = kne 


OC n-i^kp{xi\9l) (^) 

oc a/ p{xi\9l)dGo. 


(13) 


7 










Daniel Khashabi, John Wieting, Jeffrey Yufei Liu, Feng Liang 


The cluster parameters {6*,... given the partition zi:n and the data xi:„, can be 


updated similarly as they were in Algorithm 2 of Neal (2000). 


4. RDP-means: A Deterministic Algorithm 


In this section, we apply the scaling trick as described in Jiang et al. (2012) to transform 


the Gibbs sampler to a deterministic algorithm, which we refer to as RDP-mean. 


4.1 Reparameterization of the exponential family nsing Bregman divergence 

We bring in the notion of Bregman divergence and its connection to the exponential family. 
Our starting point is the formal definition of the Bregman divergence. 


Definition 1 ((Bregman, 1967)) Define a strictly convex function (/> : 5 — )• M, such that 
the domain S C MP is a convex set, and (j) is differentiable on ri{S), the relative interior 
of S, where its gradient Vfi exists. Given two points x,y G the Bregman divergence 
D^(-x,'y) : S x ri{S) — >■ [0, +oo) is defined as: 

^0(x,y) = (fix) - (piy) - (x-y,V(^(y)). 

The Bregman divergence is a general class of distance measures. For instance, with a 
squared function Bregman divergence is equivalent to Euclidean distance (See Table 1 in 


Banerjee et al. (2005) for other cases 


Forster & Warmuth (2002) showed that there exists a bijection between exponential 


families and Bregman divergences. Given this connection, Banerjee et al. (2005) derived 


a K-means type algorithm for fitting a probabilistic mixture model (with fixed number of 
components) using Bregman divergence, rather than the Euclidean distance. 

Definition 2 (Legendre Conjugate) For a function if {.) defined overMP, define its con¬ 
vex conjugate if*i.) as, if*ipL) = sup 5 )g^o^(,^) {(p, 0) —ifiO)}. In addition, if the function 
ifiO) is closed and convex, {if*)* = if. 

It can be shown that the log-partition function of the exponential families of distri¬ 
butions is a closed convex function (see Lemma 1 of Banerjee et al. (2005)). Therefore 


there is a bijection between the conjugate parameter p of the Legendre conjugate if*{-), 
and the parameter of the exponential family, 6, in the log-partition function defined for the 
exponential family at (§. With this bijection, we can rewrite the likelihood @ using the 
Bregman divergence and the Legendere conjugate: 


Pix\0) =p(x|p) = exp{-D^,{^,fx)) f^*{x), 


(14) 


where f^*ix) = exp(V’*(x) — /i(x)). The left side of ( |T4| ) is written as p(x|0) = p(x|p) to 
stress that conditioning on 9 is equivalent to conditioning on p, since there is a bijection 
between them and the right side of (14) is essentially the same as (|^. A nice intuition 
about this reparameterization is that now the likelihood of any data point x is related to 
how far it is from the cluster components parameters p, where the distance is measured 
using the Bregman divergence D^*(x, p). 
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Similarly we can rewrite the prior 0 in terms of the Bregman divergence and the 
Legendere conjugate: 

p{0\t, t]) = p(//|t, 77 ) = exp g^* (r, g), (15) 

where g^* (r, r/) = exp (gipiO) — m{T, g)). 


4.2 Scaling the Distributions 

Lemma 3 (Jiang et al. ( ]2012 )) Given the exponential family distribution &. define 
another probability distribution with parameter 0, and log-partition function where 

9 = g9, and fi{9) = 'y'ip{9/'y), then: 

1. The scaled probability distribution p{.) defined with parameter vector 9, and log-partition 
function if {.), is a proper probability distribution and belongs to the exponential family. 

2. The mean and variance of the probability distribution p{.) are: 

Ep(x) = Ep(x), CoVp{x.) = —CoVp{x.). 


3. The Legendre conjugate ofip{.]nis: 


r{9) = ^r{e). 


The implication of Lemma is that the covariance Covp(ai) scales with I/ 7 , which is 
close to zero when 7 is large, but the mean Ep(a;) remains the same. Thus we can obtain a 
deterministic algorithm when 7 goes to infinity. 

With the scaling trick, the scaled prior and scaled likelihood can be written as: 


p(x| 6 l, 7 ) = p(x|/^, 7 ) = exp (x, ^T)) (x) 

p{&\T,g,l) =p(/i|'r,r 7 , 7 ) = exp (^,/i)) g-^^*{Th.gjg) 


(16) 


4.3 Asymptotics of TVclust 


Using the scaling distributions (16) we can write the Gibbs update (13) in the following 
form: 


p{zi = k) oc n_j,fcexp(- 7 L)^.(xj,/ifc)) ("~^) 

p{zi = Ke^) (xa p{xi\9)p{9\T,g)d9 


(17) 


Following Jiang et al. (2012), we can approximate the integral I = f p{x.\9)p(9\t , g)d9 using 
the Laplace approximation (Tierney &z, Kadane, 1986): 


p{x\T,g,g) Ri5..y^*(T/7, 77 / 7 ) exp ( -gfiir/g) - (g + g)4>i ^ ^ ) ) 7 Cov 


7X + T 

7 + r 


2 . 


the conjugate of ip{.), is denoted with for simplicity. 
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We can write the resulting expression as a product of a function of the parameters and a 
function of the input observations: 

p(x|T,ry,7) k(t,77,7) x u(x;t,7,7) 

The concentration parameter of the DPM, a in Q, is usually tuned by user. To get the 
desired result, we choose it to be: 

« = «(T,7,7)“^exp(A7), 


where A is a new parameter introduced for the model. In other words, the effect of the 
other parameters {a, r, rj) is now transferred to A. Then the 2nd line of 0 becomes 


p{Zi = knew) 


1 ^(xi;r,T/,7) 

Z n + a — 1 


exp (- 7 A), 


(18) 


such that u(xj;T, 7 , 7 ) becomes a positive constant when 7 goes to infinity. Applying a 
similar trick to the 1st line of (|17[), we have 


P 

1-q 


1 — p 

q 


= exp fl In 


P 

= exp {7 -4-6)} 


— si In 


q 

1 — p 


where we introduced new variables = In ^2 = In which represents the 

confidence on having a link, and not having a link, respectively. Then the 1st line of Q 
becomes: 


p{zi = k) 


1 ^—i,k 

Z n + a — 1 


exp {-7 (T»^*(xi,/Xfc) - 4.6 + 4 - 6 )} • 


Combining (18) and (19), we can rewrite the Gibbs updates (13) as follows: 


p{zi = k) oc n_i,fcexp{-7 (x*, - /^.^i + sl.^2)} 

p{zi = knew) oc u(xj;T,7/,7)exp(-7A). 


(19) 


When 7 goes to infinity, the Gibbs sampler degenerates into a deterministic algorithm, 
where in each iteration, the assignment of Xj is determined by comparing the K + 1 values 
below: 

|iA^* (xj,/X;^) —+ sl^2, • ■ •, (xi,/x^c) —/^.^i + a|; 

If the A:-th value (where k = 1,..., K) is the smallest, then assign Xj to the A:-th cluster. If 
A is the smallest, form a new cluster. 
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4.4 Sampling the cluster parameters 

Given the cluster assignments the cluster centers are independent of the side in¬ 

formation. In other words, the posterior distribution over the cluster assignments can be 
written in the following form: 


P(/^fc|xi:n,2l:n,T,r/,7,0 OC 


i:Zj=k 


oc exp - {-iUk + ri) D^* -^-, /X;, 

' ' 7nfe -h r] 


in which = # {i : Zj = k}. When 7 —)■ 00 , 


P(Mfc|xi:n, Zl-,n, T, V, 7, 0 OC exp - {jUk + T]) D^* 


^ l.Z^—K J J 


The maximum is attained when the arguments of the Bregman divergence are the same, 

i.e. 


■ s, 
i-.Zi=k 


Xi. 


So cluster parameters are just updated by the corresponding cluster means. This completes 
the algorithm for RDP-mean which is shown in Algorithm 


4.5 Effect of changing and ^2 

Taking ^ 1,^2 0, RDP-means will behave like DP-means, i.e. no side information is con¬ 
sidered. Taking ^ 1,^2 +00 puts all the weight on the side information and no weight 

on the point observations. In other words, it generates a set of clusters according to just 
the constraints in E. In a similar way, we can put more weight on may links compared to 
may-not links by choosing ,^1 > ^2 and vice versa. 


As we will show, there is an objective function which corresponds to our algorithm. The 
objective function has many local minimum and the algorithm minimizes it in a greedy 
fashion. Experimentally we have observed that if we initialize = C 2 = C with a very small 
value ^0 and increase it each iteration, incrementally tightening the constraints, it gives a 
desirable result. 


4.6 Objective Function 

Theorem 4 The constrained clustering RDP-means (Algorithm 1) iteratively minimizes 
the following objective function. 

K 

X] X] Pk) - 6 /fc + 64] + ( 20 ) 

{2^fc}fe=l k=li£lk 

where Ti,. ■ ■ ,Tk denote a partition of the n data instances. 
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Algorithm 1: Relational DP-means algorithm 


Input: The data points D = {x;}, Relational matrix E, The parameter of the Bregman divergence ip*, the 
parameters A, and its rate of increase at each iteration ^rate- 

Result: The assignment variables z = [ 21 , 22 ? • • • ? ^n] and the component parameters. 

Initialization: and all points are assigned to one single cluster, 

while not converged do 

for £ T> do 

for E C do 

Find the values of fl and si for from the matrix E, and using the current z as defined in 
<(- D^*{yLi,n^) - iifl +64 ; 

end 

[dmirn^min] ^ {dzst(xi,/^i),. .. ,dist(xi,; 

// c^inin is the minimum distance and 2min is the index of the minimum distance, 
if ^min < ^ then 

I ^min i 

else 

// Add a new cluster: 

C — {C U Xj } 

K ^ X + l 

end 

end 

for E C do 

// given the current assignment of points, find the set of points assigned to cluster fc, 
if \T>j I > 0 then 

I ^ PTT 

else 

I // Remove the cluster and apply the changes to the related variables ; 

end 

end 

^ ^ ^ Crate 

end 


Proof In the proof we follow a similar argument as in Kulis &: Jordan] (2011). For sim¬ 
plicity, let us assume = ^2 = ^ and call the value fij^) — ^{fl — s\) the augmented 

distance. For a fixed number of clusters, each point gets assigned to the cluster that has 
the smaller augmented distance, thus decreasing the value of the objective function. When 
the augmented distance value of an element L> 0 (xj,/x^) — — s^) is more than A, we 

remove the point from its existing cluster, add a new cluster centered at the data point 
and increase the value of K by one. This increases the objective by A (overall decrease in 

the objective function). For a fixed assignment of the points to the clusters, finding the 

cluster centers by averaging the assigned points minimizes the objective function. Thus, 
the objective function is decreasing after each iteration. ■ 


4.7 Spectral Interpretation 

Following the spectral relaxation framework for the K-means objective function introduced 


by Zha et al. (2001) and Kulis & Jordan (2011), we can apply the same reformulation to our 


framework, given the objective function (20). Consider the following optimization problem: 


max tr 
{Y\YTY=In} 


(K-AI + - 6 K-) V 


( 21 ) 


12 

























Clustering With Side Information: From a Probabilistic Model to a Deterministic Algorithm 


where V = Z(Z~’'Z) G is the normalized point-component assignment matrix, and 
K is the kernel matrix which is defined as: 


K = A^A€MP^p, = [xi,...,x„] G 

E~^ = 1{E > 0} and E~ = \{E < 0} are side information matrices for may and may-not 
links, respectively (where !{.} is applied elementwise). 

In particular if = ^2 = ? Equation [M] becomes: 

max tr (y^ (K - XI + IE) . 

Iy\yty=i„} V ' ' ; 


Theorem 5 The objective function in (21) is equivalent the objective function in 


fcoil ) (Lemma 5.1) it has been proved that m.a'K^Y\Y^Y=in} ~ 

me objective function of DP-means. For simplicity, we prove the 


Proof In Kulis &: Jordan 
is equivalent minimizing tl 
case for = ^2 = although the general case can also be proved in a very similar fashion. 
In our objective function we have the additional term max|y|yTy= 7 ^| tr (E"'" (fE) E) which 

we will prove to be ^ Ef=i (/fc “ 4 ) ■ 


tr (y^{(,E)Y^ = ^tr {y^'EY^ 


fc=iieXfcjeXfc 

«E E 1E 1 = 1} - E 1 = -1} 

k=iie.Xk V'eXfe jelk 

«EE(fi-4)- 

k=l 


Given the objective function in Equation |21t 


we can use 


Theorem 5.2 inlKulis & Jordan 


(2011) and design a spectral algorithm for solving our problem, simply by finding eigen¬ 
vectors of iL + I,iE^ — I, 2 E~ that have an eigenvalue larger than A. By this interpretation 
one can easily see that if Ci = ^2 = 0, the objective function is equivalent to the DP-means 
objective and when ^1 and ^2 are large, the clustering only makes use of the side information. 


5. Experiments 

In this section we report experiments on simulated data, a variety of UCI datasets and an 
Image Net d atasetpj For e valuation, we report the E-measure (F) exactly as defined in 
Section 4.1 of Bilenko et al. (2004), adjusted Rand index (AdjRnd) and normalized mutual 


3. The code and data for our experiments and implementation is available at https://goo.gl/i6yoPb. 
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True labels K-means 
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True labels K-means 
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Figure 2: Comparison of cluster quality on example of our synthetic data. 


information (NMI). For RDP-Means, in all experiments, we terminate the algorithm when 
the cluster assignments did not change after 20 iterations, and we initialize = 0.001 and 
Crate = 2. For DP-means and RDP-means we calculate A based on the k-th furthest first 
method explained in Kulis & Jordan (2011). Although we use the actual k in calculating 


A, in practice, A is less sensitive to initialization (See Figure]^. 

We compare with all constrained or semi-supervised clustering techniques from the lit¬ 
erature that we could find online and from personal communications]^ In our experiments, 
we do not include results from methods where we observed unstable behavior such as nu¬ 
merical instabilities. The parameters for all algorithms were set to the default settings of 
the authors’ implementation, 

We experiment on three tasks. The first experiment evaluates the algorithms on a set of 
two-dimensional simulated data that showcases difficult clustering problems in order to gain 
some visual intuition of how the algorithms perform. The second experiment evaluates on a 
collection of 5 datasets from the UCI repository, commonly used for evaluation of clustering 
tasks: iris, wine, ecoli, glass, and balance. We also study the effect of varying some of the 
key parameters in these experiments. The third task illustrates the effectiveness of using 


side information in the image clustering task of Jiang et al. (2012). 


One important variable is the percentage of side information r, which is the number 
of ±1 elements in the E matrix (Section 3.1) normalized by its size. We experimented 


4. See http://goo.gl/tSSH95 for a list of methods with links to their implementa tions. 

5. The code for LCVQE is kindly provided by the authors of Covoes et al. (20131 via personal communi¬ 
cation. 
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Method \ Dataset 

iris 

wine 

ecoli 


E 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

K-means 

0.81 

0.71 

0.74 

0.59 

0.36 

0.42 

0.61 

0.50 

0.63 

DP-means 

0.74 

0.57 

0.69 

0.63 

0.37 

0.44 

0.71 

0.61 

0.64 

TVClust (variational) 

0.91 

0.85 

0.90 

0.56 

0.45 

0.53 

0.85 

0.79 

0.76 

RDP-means 

0.86 

0.80 

0.80 

0.81 

0.73 

0.72 

0.90 

0.86 

0.82 

MPCKMeans 

0.53 

0.29 

0.30 

0.54 

0.30 

0.30 

0.57 

0.33 

0.33 

LCVQE 

0.73 

0.58 

0.60 

0.57 

0.35 

0.39 

0.61 

0.52 

0.61 

Method \ Dataset 

glass 

balance 

averaged over datasets 


E 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

K-means 

0.57 

0.47 

0.71 

0.47 

0.14 

0.12 

0.61 

0.44 

0.52 

DP-means 

0.53 

0.29 

0.45 

0.31 

0.12 

0.21 

0.58 

0.39 

0.48 

TVClust (variational) 

0.42 

0.22 

0.42 

0.94 

0.92 

0.91 

0.74 

0.64 

0.70 

RDP-means 

0.82 

0.76 

0.73 

0.94 

0.92 

0.88 

0.87 

0.81 

0.79 

MPCKMeans 

0.46 

0.30 

0.36 

0.70 

0.26 

0.28 

0.56 

0.30 

0.31 

LCVQE 

0.64 

0.55 

0.66 

0.62 

0.38 

0.37 

0.64 

0.48 

0.53 


Table 1: Results over each UCI dataset averaging over p and r parameters. RDP-means has 
the best performance overall but not in some particular cases as shown in Tables and [3l 


with varying r, adding noise with probability 1 — p to the constraint matrix, and deviating 
initialization from the true k value. 

5.1 Simulated Data 

We evaluate on 6 different patterns using p = 1 (i.e. no noise) with a sampling rate of 
r = 0.01. The results for all patterns are shown in Figure Each algorithm was tested 5 
times, and we took the strongest result from these 5 runs to display. 

5.2 UCI Datasets 

We evaluate on five dataset^ experimenting with different settings. In the first setting, 
we vary the percentage of constraints sampled, r, which we choose from {0.01,0.03,0.05}. 
Secondly, we add noise, letting the parameter p take on values in {1.0,0.95,0.90,0.80}. 
Lastly, we investigate the sensitivity of the algorithms to deviations from the true number 
of clusters, k, where we choose the deviation from the set {±3, ±2, ±1, 0}. For each dataset 
and set of hyper-parameters in this section, we average the results of five trials to produce 
the final result. 

Average over all parameters: We present the performance of the algorithms, per 
dataset, averaged over different values of the parameters p and r. The results are summa¬ 
rized in Table and show that overall, RDP-means has the best performance. 

Average all parameters varying amount of noise: To analyze how adding noise 
to the constraints affects performance, we vary the values of p for each algorithm on each 
dataset. As mentioned previously, the probability of choosing noisy constraints is propor¬ 
tional to 1 — p. The higher the value of p, the less noise in the constraints. The results as 


6. The data is directly downloaded from http: //archive. ics . uci. edu/ml/ 
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Method \ Param. 

P = 1 

p = 0.95 

p = 0.9 

p = 0.8 


f 

AdjRnd 

nmi 

f 

AdjRnd 

NMI 

f 

AdjRnd 

NMI 

f 

AdjRnd 

NMI 

K-means 

0.61 

0.44 

0.53 

0.60 

0.43 

0.52 

0.61 

0.43 

0.53 

0.61 

0.43 

0.52 

DP-means 

0.58 

0.39 

0.49 

0.59 

0.39 

0.48 

0.59 

0.40 

0.49 

0.58 

0.38 

0.48 

TVClust (variational) 

0.77 

0.69 

0.74 

0.75 

0.66 

0.72 

0.73 

0.63 

0.69 

0.68 

0.56 

0.64 

RDP-means 

0.93 

0.90 

0.89 

0.92 

0.89 

0.87 

0.87 

0.82 

0.79 

0.75 

0.65 

0.62 

MPCKMeans 

0.94 

0.91 

0.90 

0.46 

0.14 

0.17 

0.44 

0.10 

0.12 

0.41 

0.04 

0.07 

LCVQE 

0.83 

0.76 

0.79 

0.64 

0.48 

0.53 

0.58 

0.39 

0.45 

0.50 

0.27 

0.35 


Table 2: This table illustrates how the algorithms perform under different levels of noise. 
We average the results over each UCI dataset and values of r. 


a function of noise are summarized in Table We average over different constraint sizes 
r £ {0.01,0.03,0.05} and different datasets. 

First note that the results for K-means and DP-means are the same across different noise 
ratesj^ since these algorithms do not make use of constraints. Another observation is that, 
MPCKMeans has the best performance for p = 1, although for p = 0.95 its performance 
drops significantly. Therefore, this method is a good option when the side information is 
relatively pure. Other methods, including RDP-means, TVClust and LCVQE, have drops 
as well when increasing the noise level, although the drops for TVClust and RDP-means 
are smaller. 

Average all parameters varying amount of side information: To better under¬ 
stand the effect of side-information, we unroll the results of Table and show the perfor¬ 
mance as a function of r. The results are shown in Table [Sl 

Unsurprisingly, adding constraints (increasing r) increases the performance of those 
algorithms that make use of them. Interestingly, for p = 0.8 and r = 0.01, the best 
algorithms that do make use of constraints have similar performance to K-means and DP- 
means (which do not use constraints). This suggests there could be space for improvement 
for handling noisy constraints. 

Effect of deviation from true number of clusters: Most of the algorithms we 
analyze are dependent on the true number of clusters, which is usually unknown in practice. 
Here, we investigate the sensitivity of those algorithms to perturbations in the true value 
of k. The DP-means algorithm of Jiang et al. (2012) is said to be less sensitive to the 
choice of k, since its parameter has weaker dependence on the choice of k. Similarly, since 
RDP-means is derived from DP-means, it is expected that it too would be relatively robust 
to deviations from the actual k. 

For all algorithms and for each dataset, we set p = 1 and r = 0.03 and vary the number 
of clusters to k — deviation where deviation G {±3,±2,±1, 0)0 The results are shown 
in Figure The x-axis shows the value of deviation and the y axis shows the value of 
F-measure. 

Notice the performance of DP-means is clearly stable for different choices of k, which 
supports the claim made in Jiang et al. (2012). Similarly RDP-means and TVClust show 
very stable results. MPCKmeans generally works well unless k is underestimated. 


7. Small variations in the results of K-means is possible due to random initialization of each run 

8. For some datasets where fc = 3, we dropped the value deviation = 3. Also the implementation of 
LCVEQ that we used needs at least 2 clusters to work. We are not aware of a more general available 
implementation for this algorithm. 
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Method \ Param. 

P= 1 

p = 0.95 

p = 0.9 

II 

o 

bo 



F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 


K-means 

0.62 

0.45 

0.53 

0.62 

0.45 

0.53 

0.61 

0.44 

0.53 

0.60 

0.43 

0.52 

o 

DP-means 

0.58 

0.39 

0.48 

0.59 

0.39 

0.48 

0.58 

0.38 

0.48 

0.58 

0.38 

0.48 

o 

TVClust (variational) 

0.72 

0.62 

0.69 

0.69 

0.57 

0.65 

0.66 

0.52 

0.60 

0.58 

0.43 

0.53 


RDP-means 

0.84 

0.77 

0.76 

0.79 

0.71 

0.68 

0.69 

0.58 

0.57 

0.56 

0.39 

0.41 


MPCKMeans 

0.83 

0.76 

0.73 

0.52 

0.23 

0.29 

0.49 

0.16 

0.21 

0.44 

0.07 

0.12 


LCVQE 

0.73 

0.62 

0.66 

0.69 

0.56 

0.59 

0.62 

0.46 

0.49 

0.52 

0.32 

0.36 



F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 


K-means 

0.60 

0.43 

0.52 

0.60 

0.42 

0.52 

0.61 

0.44 

0.53 

0.61 

0.45 

0.53 

oo 

o 

DP-means 

0.59 

0.39 

0.49 

0.58 

0.39 

0.49 

0.59 

0.39 

0.48 

0.58 

0.38 

0.47 

o 

TVClust (variational) 

0.78 

0.70 

0.75 

0.77 

0.69 

0.74 

0.76 

0.68 

0.73 

0.70 

0.59 

0.67 


RDP-means 

0.98 

0.98 

0.96 

0.98 

0.97 

0.94 

0.93 

0.90 

0.86 

0.77 

0.69 

0.63 


MPCKMeans 

0.99 

0.99 

0.97 

0.43 

0.07 

0.10 

0.41 

0.04 

0.06 

0.40 

0.02 

0.04 


LCVQE 

0.86 

0.81 

0.85 

0.62 

0.47 

0.52 

0.57 

0.38 

0.45 

0.48 

0.25 

0.33 



F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 

F 

AdjRnd 

NMI 


K-means 

0.62 

0.45 

0.54 

0.59 

0.42 

0.52 

0.60 

0.43 

0.52 

0.60 

0.42 

0.51 

lO 

o 

DP-means 

0.58 

0.39 

0.49 

0.59 

0.39 

0.48 

0.60 

0.42 

0.50 

0.58 

0.38 

0.48 

o 

TVClust (variational) 

0.81 

0.74 

0.79 

0.79 

0.72 

0.77 

0.78 

0.70 

0.75 

0.74 

0.66 

0.72 


RDP-means 

0.96 

0.96 

0.95 

0.99 

0.99 

0.98 

0.98 

0.97 

0.94 

0.91 

0.87 

0.82 


MPCKMeans 

1.00 

1.00 

0.99 

0.44 

0.12 

0.13 

0.41 

0.08 

0.09 

0.38 

0.03 

0.05 


LCVQE 

0.88 

0.84 

0.86 

0.60 

0.42 

0.47 

0.55 

0.34 

0.41 

0.50 

0.25 

0.34 


Table 3: This table is an expanded version of Table and shows how the algorithms perform 
under different levels of noise and for each constraint sampling rate, averaged over each UCI 
dataset. 


5.3 ImageNet Clustering 


We repeat the same experiment from Jiang et al. (2012), where 100 images from 10 different 
categories of the ImageNet data were sampledjj Each image was processed via standard 
visual-bag-of-words where SIFT was applied to images patches and the resulting SIFT 
vectors were mapped into 1000 visual works. The SIFT feature counts were then used as 
features for that image, and since these features are discrete counts, they were modeled 
as if coming from a multinomial distribution. Thus we used the corresponding divergence 
measure, i.e. KL-divergence (as opposed to Euclidean distance in the Gaussian case) as the 
distance metric in the clustering. 


We use Laplace smoothingj^ with a smoothing parameter of 0.3 to remove the ill- 
conditioning (division by zero inside the KL divergence). We also include the clustering 
results using a Gaussian model, to show the importance of choosing the appropriate distri¬ 
bution. The result of the evaluation are in Table Clearly RDP-means has the best result 
as it makes use of the side information. 


We also investigate the behavior of RDP-means as a function of the percentage of pairs 
sampled. The result is depicted in the Figure]^ In the case when the rate is close to zero, 
the model is equivalent to DP-means. The figure shows that, as we add more constraints 
the performance of the model consistently increases. When we sample only 6% of the pairs, 
we are able to almost fully reconstruct the true clustering without any loss of information. 


9. The set of images from each dusters, and the extracted SIFT features are available at http: //image-net. 
org 

10. See http://en.wikipedia.org/wiki/Additive_smoothing 
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Ecoli Balance 



Glass Wine 






0l-1-1-1-1-1 

- 3 - 2-1 0 1 2 

deviation 


Figure 3; Comparison of clustering quality on each UCI datasets with deviations from 
the actual number of clusters. The x-axis shows deviation, where the number of clusters 
declared to each algorithm is A: — deviation. The y-axis shows the F-measure evaluation of 
each clustering result. 



DP-Means 

K-Means 

RDP-Means 


Gaussian 

Multinomial 

Gaussian 

Multinomial 

Gaussian 

Multinomial 

F’-measure 

0.18 

0.22 

0.18 

0.25 

0.20 

0.44 


Table 4: Results of clustering on ImageNet dataset. For each measure, the result is averaged 
over 10 runs. 
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Rate for the pairs of constraints added to the constrained matrix E 

Figure 4: Effect of adding side information to the performance of RDP-means. As more 
information it added, performance improves. 
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